I am incredibly excited that RStudio has begun an instructor certification program based on the Carpentries, so of course I signed up as soon as my overcommited nature allowed! This also provides me with the excuse and motivation to finally formally work my way through R for Data Science, a book I have read while waiting for GTT tests during my pregnancy and google-landed upon an umpteen number of times while debugging code, but never taken the time to sit down and do the exercises for - and of course the pedagogue in me knows quite well that THAT is how you actually learn and internalise the principles and concepts in any material, especially if it deals with programming and analysis. So over the next few weeks I plan to work my way through R4DS, and this post is the first in which I dive into the exercises.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.3
## ✓ tidyr 1.0.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# devtools::install_github("thomasp85/patchwork")
library(patchwork)
Tidying - storing data in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation.
Sampling may be enough to answer the question.
ggplot() aestheticsstroke - is either the size of the point (for a default geom_point()) OR, if used with shape 21-25, which have both a colour and a fill, is the thickness of the stroke around the plotted shape.
You can generally use geoms and stats interchangeably! For example, you can use stat_count() instead of geom_bar() to make the same plot!
ggplot(data = diamonds) + geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))
# not really new, but I'm sure I'll forget position = "fill"
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")
# pie chart from bar
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = 1, fill = clarity)) + coord_polar(theta = "y")
On average, humans are best able to perceive differences in angles relative to 45 degrees. The function ggthemes::bank_slopes() will calculate the optimal aspect ratio to bank slopes to 45-degrees.
| geom | default stat | shared docs |
|---|---|---|
| geom_abline() | ||
| geom_hline() | ||
| geom_vline() | ||
| geom_bar() | stat_count() | x |
| geom_col() | ||
| geom_bin2d() | stat_bin_2d() | x |
| geom_blank() | ||
| geom_boxplot() | stat_boxplot() | x |
| geom_countour() | stat_countour() | x |
| geom_count() | stat_sum() | x |
| geom_density() | stat_density() | x |
| geom_density_2d() | stat_density_2d() | x |
| geom_dotplot() | ||
| geom_errorbarh() | ||
| geom_hex() | stat_hex() | x |
| geom_freqpoly() | stat_bin() | x |
| geom_histogram() | stat_bin() | x |
| geom_crossbar() | ||
| geom_errorbar() | ||
| geom_linerange() | ||
| geom_pointrange() | ||
| geom_map() | ||
| geom_point() | ||
| geom_map() | ||
| geom_path() | ||
| geom_line() | ||
| geom_step() | ||
| geom_point() | ||
| geom_polygon() | ||
| geom_qq_line() | stat_qq_line() | x |
| geom_qq() | stat_qq() | x |
| geom_quantile() | stat_quantile() | x |
| geom_ribbon() | ||
| geom_area() | ||
| geom_rug() | ||
| geom_smooth() | stat_smooth() | x |
| geom_spoke() | ||
| geom_label() | ||
| geom_text() | ||
| geom_raster() | ||
| geom_rect() | ||
| geom_tile() | ||
| geom_violin() | stat_ydensity() | x |
| geom_sf() | stat_sf() | x |
| stat | default geom | shared docs |
|---|---|---|
| stat_ecdf() | geom_step() | |
| stat_ellipse() | geom_path() | |
| stat_function() | geom_path() | |
| stat_identity() | geom_point() | |
| stat_summary_2d() | geom_tile() | |
| stat_summary_hex() | geom_hex() | |
| stat_summary_bin() | geom_pointrange() | |
| stat_summary() | geom_pointrange() | |
| stat_unique() | geom_point() | |
| stat_count() | geom_bar() | x |
| stat_bin_2d() | geom_tile() | x |
| stat_boxplot() | geom_boxplot() | x |
| stat_countour() | geom_contour() | x |
| stat_sum() | geom_point() | x |
| stat_density() | geom_area() | x |
| stat_density_2d() | geom_density_2d() | x |
| stat_bin_hex() | geom_hex() | x |
| stat_bin() | geom_bar() | x |
| stat_qq_line() | geom_path() | x |
| stat_qq() | geom_point() | x |
| stat_quantile() | geom_quantile() | x |
| stat_smooth() | geom_smooth() | x |
| stat_ydensity() | geom_violin() | x |
| stat_sf() | geom_rect() | x |
##Data Viz Exercises
ggplot(data = mpg)
Nothing, because we haven’t selected a geom.
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
dim(mpg)
## [1] 234 11
?mpg
4.Make a scatterplot of hwy vs cyl.
# set all ggplot figures to use minimal theme
theme_set(theme_classic())
mpg %>%
ggplot(aes(x = cyl, y = hwy)) + geom_point()
# use transparency and jitter to make the points separate better
mpg %>%
ggplot(aes(x = cyl, y = hwy)) + geom_jitter(width = 0.4, alpha = 0.5)
theme_set(theme_minimal())
mpg %>%
ggplot(aes(x = class, y = drv)) + geom_point()
Because it is plotting a category vs a category, so most of the space in the plot cannot be filled. However, I’d argue that it’s not completely useless as it does show that all 2 seater cars have rear wheel drive, while all minivans have forward wheel drive.
The barplot below probably presents a better visualisation, as it also shows that we may not have sampled enough 2 seater vehicles to identify whether any of them could possibly have forward drive. Having said that, I do like the dot-plot visualisation as well.
theme_set(theme_minimal())
mpg %>%
ggplot(aes(x = class, fill = drv)) + geom_bar()
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
Because color has been set within the aesthetic, so ggplot is assuming that we want to set the value of the colour aesthetic to the string blue. To fix:
ggplot(data = mpg) +
geom_point(mapping = aes(x = displ, y = hwy),color = "blue")
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
## $ model <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
## $ displ <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
## $ year <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
## $ cyl <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
## $ trans <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
## $ drv <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
## $ cty <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
## $ hwy <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
## $ fl <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class <chr> "compact", "compact", "compact", "compact", "compact", "…
We can use the glimpse() command, which will show us the type of each variable. Those that are ‘chr’ (character) are categorical, whereas those that are ‘int’ (integer) are continuous.
mpg %>% ggplot(aes(x = displ, y = hwy, col = hwy)) + geom_point()
For a continuous variable, colour is use to represent a gradient.
mpg %>% ggplot(aes(x = displ, y = hwy, size = hwy)) + geom_point()
Size becomes bigger as the values get bigger
# mpg %>% ggplot(aes(x = displ, y = hwy, shape = hwy)) + geom_point()
Shape gives an error.
mpg %>% ggplot(aes(x = displ, y = hwy, col = manufacturer)) + geom_point()
Colour colours the points by the levels of the category.
mpg %>% ggplot(aes(x = displ, y = hwy, size = manufacturer)) + geom_point()
## Warning: Using size for a discrete variable is not advised.
Size is not advised, but still works.
# mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer)) + geom_point()
# Shape by default throws an error, since only 6 shapes are allowed
mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer)) + geom_point() + scale_shape_manual(values=1:length(unique(mpg$manufacturer)))
Shape by default doesn’t work, but can be coerced by using scale_shape_manual() to present more than 6 shapes.
It gets mapped!
mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer, col=manufacturer)) + geom_point() + scale_shape_manual(values=1:length(unique(mpg$manufacturer)))
mpg %>% ggplot(aes(x = displ, y = hwy, stroke = displ)) + geom_point()
Increases the thickness of the stroke as values of the variable get larger.
mpg %>% ggplot(aes(x = displ, y = hwy, colour = displ < 5)) + geom_point()
The expression will be evaluated, and the variable plotted will be (displ<5).
mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + facet_grid(~hwy)
It treats it as a categorical - so bad things!
ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl))
There are no cars with cyl == 7 or (drv == r where cyl ==4) or (cyl ==5 and drv ==4 or drv == r).
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .)
. says not to faced on that dimension.
Take the first faceted plot in this section. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?
ggplot(data = mpg) +geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)
Cleaner to see the trend in each level of displ. If we had a larger dataset this would be more important, as overlaying all of the points would create a data blob instead of a meaningful visualisation.
5.Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?
nrow and ncol specify how many rows and columns we want our panels to be split into. facet_grid() doesn’t do this, as it uses the number of factor levels in the data we’re faceting by to cleanly present this automatically.
scales is veru useful as it allows us to have free scales (i.e. different scales) for each of our individual plots.
Because that allows us to better see the spread of the data.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Hides the legend for that geom layer. Note that if you want to hide the legend completely, you need to include it in each geom level we present, so geom_point(show.legend = FALSE) and geom_smooth(show.legend = FALSE) for the plot above.
Specifies whether to show the standard error.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot() +
geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) +
geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
No, they should be identical, because the specify the same x/y aesthetics for both geoms.
I code the six plots to variables first, and then use the patchwork library to present them in one figure below:
one <- mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + geom_smooth(se=FALSE)
two <- mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + geom_smooth(aes(fill = drv), se=FALSE, show.legend = F)
three <- mpg %>% ggplot(aes(x = displ, y = hwy,col = drv)) + geom_point() + geom_smooth(se = F)
four <- mpg %>% ggplot() + geom_point(aes(x = displ, y = hwy,col = drv)) + geom_smooth(aes(x = displ, y = hwy), se = F)
five <- mpg %>% ggplot(aes(x = displ, y = hwy,col = drv, linetype = drv)) + geom_point() + geom_smooth(se = F)
six <- mpg %>% ggplot(aes(x = displ, y = hwy, fill = drv)) + geom_point(shape = 21, col = "white", stroke = 2, size = 3) + theme_gray()
one + two + three + four + five + six+ plot_layout(ncol = 2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
1.What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
# original
ggplot(data = diamonds) +
stat_summary(
mapping = aes(x = cut, y = depth),
fun.ymin = min,
fun.ymax = max,
fun.y = median
)
#modified
ggplot(data = diamonds, aes(x = cut, y = depth)) +
geom_pointrange(stat = "summary",
fun.ymin = min,
fun.ymax = max,
fun.y = median)
It is the equivalent of geom_bar(stat=“identity”).
| geom | stat |
|---|---|
| geom_bar() | stat_count() |
| geom_bin2d() | stat_bin_2d() |
| geom_boxplot() | stat_boxplot() |
| geom_contour() | stat_contour() |
| geom_count() | stat_sum() |
| geom_density() | stat_density() |
| geom_density_2d() | stat_density_2d() |
| geom_hex() | stat_hex() |
| geom_freqpoly() | stat_bin() |
| geom_histogram() | stat_bin() |
| geom_qq_line() | stat_qq_line() |
| geom_qq() | stat_qq() |
| geom_quantile() | stat_quantile() |
| geom_smooth() | stat_smooth() |
| geom_violin() | stat_violin() |
| geom_sf() | stat_sf() |
Many (but not all) have similar names.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut, y = ..prop.., fill = color, group = 1))
The proportions are calculated within the groups, so it’s always presented out of 100%. To get the “best” visualisation:
ggplot(data = diamonds) +
geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))
What is the problem with this plot? How could you improve it?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point()
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_jitter(height = 1, width = 1, alpha = 0.6)
The points overlap. To address: use jitter and alpha.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_count()
Will plot the number of observations at each point as a blob instead of moving the points.
mpg %>% ggplot(aes(x = as.factor(cyl), y = hwy, colour = fl)) + geom_boxplot()
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = 1, fill = clarity)) + coord_polar(theta = "y")
Specify labels! x, y axes, title etc!
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
geom_point() +
geom_abline() +
coord_fixed()